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Abstract — This paper presents a theoretical research based ap- 
proach to ellipsis resolution in machine translation. The formula 
of discourse is applied in order to resolve ellipses. The validity 
of the discourse formula is analyzed by applying it to the real 
world text, i.e., newspaper fragments. The source text is converted 
into mono-sentential discourses where complex discourses require 
further dissection either directly into primitive discourses or 
first into compound discourses and later into primitive ones. 
The procedure of dissection needs further improvement, i.e., 
discovering as many primitive discourse forms as possible. An 
attempt has been made to investigate new primitive discourses 
or patterns from the given text. 

I. Introduction 

A text is not adequately translated until it is considered as 
part of a discourse. Discourses are linguistic units composed 
of several sentences. The term discourse is coined by Zellig 
Harris in 1952. Informally and intuitively, a discourse is a 
connected piece of text or spoken language of more than one 
sentence spoken by one or more speakers (T|, __. Discourse 
has been taken up in a variety of disciplines. However in 
our work, the term discourse is used in context of machine 
translation and a Discourse Unit (DU J] is taken as a unit of 
analysis. 

In this paper, a discourse-based approach is used to re- 
solve anaphoric and cataphoricj^mbiguities. The source text 
is converted into mono-sentential (primitive discourses) where 
complex discourses require further dissection either directly 
into primitive discourses or first into compound discourses 
and later into primitive ones. An attempt has been made to (a) 
investigate as many primitive discourses as possible and (b) 
finding ways of splitting compound and complex discourses 
each into discourses having forms which belong to the existing 
set of primitive discourse forms. The discourse formula is 
applied to the newspaper fragments. 

II. Dissection of complex and compound 

DISCOURSES INTO PRIMITIVE ONES 

We have applied discourse-based approach to several news- 
paper fragments theoretically and solved various anaphoric 
and cataphoric ambiguities. The complex and compound dis- 
courses are dissected into primitive discourses. In order to 

'DU is an atomic utterance that has no reference beyond its boundaries 
2 It finds its consequent in the subsequent text 



TABLE I 

Primitive discourses and Generalized patterns 



Primitive Discourses 


Generalized Patterns 


1- Swedes are 3,500 


A are B 


2- Swedes are still missing 


A are B C 


3- Swedes are in Thailand 


A are in B 


4- Waves were tidal 


A were B 


5- Waves struck coastline of the country 


A B C of the D 


6- Missing after a week 


A B a C 


7- Swedes are 60 


A are B 


8- Swedes confirmed dead 


ABC 


9- The foreign ministry said on Sunday 


The A B C on D 



understand the concept of this paper, consider an example from 
our experiments: 

[Around 3,500 Swedes are still missing in Thailand, a 
week after tidal waves struck the country's coastline, with 60 
Swedes confirmed dead, the foreign ministry said Sunday]. 
[The ministry said, it had managed to locate the missing 
tourists and struck their names off the list, but new names 
were being added all the time ]. 

There are two complex discourses in the above article 
enclosed by square brackets. After anaphora and cataphora 
resolution, complex discourses have to be dissected either 
into compound discourses and then later into primitive ones, 
OR Complex discourses are directly dissected into primitive 
discourses. The first complex discourse is directly dissected 
into a set of primitive discourses. Table [I] contains a set of 
primitive discourses compared with their generalized patterns 

The generalized discourses can be used to generate hundreds 
of sentences having the same pattern otherwise. For instance, 
the second discourse ,i.e., A are B C is also valid for Children 
are playing game. After investigating a substantial number of 
generalized primitive discourses formats, the system would 
match the patterns of new incoming primitive discourses 
with the generalized discourses already stored in the machine 
translation system and would reject an incomplete text ,i.e., 
syntactically incomplete. However, it is hard to believe that a 
system would reject an incomplete sentence at a reasonable 
accuracy. 

Compound and complex discourses are dissected into dis- 



TABLE II 

Primitive discourses and Generalized patterns 



534 



Primitive Discourses 


Generalized Patterns 


10- Tourists are missing 


A are B 


11- Ministry had managed 


A had B 


12- Tourists are located 


A are B 


13- Tourists are missing 


A are B 


14- Struck names off the list 


A B C the D 



TABLE III 

Primitive discourses and Generalized patterns 



Primitive Discourses 


Generalized Patterns 


15- Names are new 


A are B 


16- Tourists are missing 


A are B 


17- Names were being added throughout 


A were BCD 



courses having forms, which belong to the existing set of 
primitive discourse forms. For example, the pattern A are B 
belongs to the existing set of primitive discourses, already 
appeared as the first generalized discourse. A list called L* , 
which contains information regarding various articles, auxil- 
iary verbs, copula verbs and preposition, is used during our 
theoretical experiments. It has solved the problem of redundant 
predicates upto great extent 0. 

The second complex discourse is first converted into two 
compound discourses and then later into primitive discourses. 
After anaphora resolution, the complex discourse can be 
represented as: 

[The ministry said, ministry had managed to locate missing 
tourists and struck names of missing tourists off the list, but 
new names of missing tourists were being added throughout]. 

Now, there are two compound discourses in the post reso- 
lution stage. 

(a) Ministry had managed to locate missing tourists and 
struck names of missing tourists off the list. 

(b) But new names of missing tourists were being added all 
the time. 

Table |H| contains a set of primitive discourses resulted from 
the compound discourse (a) and Table [Til] contains a set of 
primitive discourses resulted from the compound discourse (b). 

III. Experiments and Evaluation 

The discourse formula, i.e., dissection of complex and 
compound discourses into primitive ones, is applied manually 
to 100 newspaper fragments. Our aim was to investigate as 
many primitive discourses as possible and to resolve anaphoric 
and cataphoric ambiguities during dissection. During our 
experiments we have investigated approximately 427 new 
primitive discourses having completely new patterns. More 
than 534 anaphoric/cataphoric ambiguities were resolved. Note 
that during our experiment we got considerable number of 
redundant predicates, i.e., predicates having the same pattern. 
For example, we got the pattern 'A are B' approximately 
more than 100 times, but it was considered only once. The 
results are given in Fig. [T] The figure shows that the discourse 
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Fig. 1. Dissection of complex and compound discourses into primitive ones 



formula being applied to the initial 50 newspaper fragments 
resulted into approximately 300 primitive discourses and the 
resolution of 264 anaphoric/cataphoric ambiguities. In the last 
50 fragments, the numbers of new primitive discourses have 
been decreased up to 127 primitive discourses, while the reso- 
lution of anaphoric/cataphoric ambiguities remained the same. 
We have concluded that as we go further to investigate new 
primitive discourses, the number of new primitive discourses 
decrease accordingly. 

IV. Conclusions 

A discourse-based approach in text-based machine transla- 
tion was the main focus of this paper where DU was taken 
as a unit of analysis. The source text was dissected into mul- 
tiple discourses, i.e., mono-sentential or poly-sentential. After 
various ambiguities were resolved, the complex discourses 
were dissected into mono-sentential discourses. The mono- 
sentential discourse can be easily translated into its corre- 
sponding target language, using different implementation tools 
such as PROLOG and LISP. However, choosing PROLOG 
could be the best implementation tool owing to the fact that 
the resultant primitive discourses are easily represented into 
their corresponding prolog statements, which is later used for 
translation. 

This paper presented the mechanism of how a source text 
is dissected into mono sentential discourses, for translation 
purposes. However, after achieving the translation of mono 
sentential discourses there is a need to rearrange the mono 
sentential discourses back into complex discourses of the target 
language. This could be the reverse mechanism of dissecting 
the source language. In future our aim is to improve this work 
by implementing the resultant primitive discourses using ap- 
propriate language, translate the discourses into corresponding 
target language, and then rearrange the translated primitive 
discourses into complex discourses of the target language. 
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